Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

dale-wahl · 2024-09-06T14:03:53Z

Processor that saves metadata, results files, and logs into a zip file.
- Automatically expires after 1 day (as these datasets are essentially just copies of the other data)
Updates to import_4cat datasource to accept ZIP files in addition to 4CAT results URLs

had some weird lost datasets when debugging this

commit 3f2a62a Author: Carsten Schnober <[email protected]> Date: Wed Sep 18 18:18:29 2024 +0200 Update Gensim to >=4.3.3, <4.4.0 (#450) * Update Gensim to >=4.3.3, <4.4.0 * update nltk as well --------- Co-authored-by: Dale Wahl <[email protected]> Co-authored-by: Sal Hagen <[email protected]> commit fee2c8c Merge: 3d94b66 f8e93ed Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:19 2024 +0200 Merge branch 'master' of https://github.com/digitalmethodsinitiative/4cat commit 3d94b66 Author: sal-phd-desktop <[email protected]> Date: Wed Sep 18 18:11:04 2024 +0200 FINALLY remove 'News' from the front page, replace with 4CAT BlueSky updates and potential information about the specific server (to be set on config page) commit f8e93ed Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 15:11:21 2024 +0200 Simple extensions page in Control Panel commit b5be128 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:08:13 2024 +0200 Remove 'docs' directory commit 1e2010a Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:07:38 2024 +0200 Forgot TikTok and Douyin commit c757dd5 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:01:31 2024 +0200 Say 'zeeschuimer' instead of 'extension' to avoid confusion with 4CAT extensions commit ee7f434 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 14:00:40 2024 +0200 RIP Parler data source commit 11300f2 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:21:37 2024 +0200 Tuplestring commit 5472652 Author: Stijn Peeters <[email protected]> Date: Wed Sep 18 11:15:29 2024 +0200 Pass user obj instead of str to ConfigWrapper in Processor commit b21866d Author: Stijn Peeters <[email protected]> Date: Tue Sep 17 17:45:01 2024 +0200 Ensure request-aware config reader in user object when using config wrapper commit bbe79e4 Author: Sal Hagen <[email protected]> Date: Tue Sep 17 15:12:46 2024 +0200 Fix extension path walk for Windows commit d6064be Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:50:45 2024 +0200 Allow tags that have no users Use case: tag-based frontend differentiation using X-4CAT-Config-Via-Proxy commit b542ded Author: Stijn Peeters <[email protected]> Date: Mon Sep 16 14:13:14 2024 +0200 Trailing slash in query results list commit a4bddae Author: Dale Wahl <[email protected]> Date: Mon Sep 16 13:57:23 2024 +0200 4CAT Extension - easy(ier) adding of new datasources/processors that can be mainted seperately from 4CAT base code (#451) * domain only * fix reference * try and collect links with selenium * update column_filter to find multiple matches * fix up the normal url_scraper datasource * ensure all selenium links are strings for join * change output of url_scraper to ndjson with map_items * missed key/index change * update web archive to use json and map to 4CAT * fix no text found * and none on scraped_links * check key first * fix up web_archive error reporting * handle None type for error * record web archive "bad request" * add wait after redirect movement * increase waittime for redirects * add processor for trackers * dict to list for addition * allow both newline and comma seperated links * attempt to scrape iframes as seperate pages * Fixes for selenium scraper to work with config database * installation of packages, geckodriver, and firefox if selenium enabled * update install instructions * fix merge error * fix dropped function * have to be kidding me * add note; setup requires docker... need to think about IF this will ever be installed without Docker * seperate selenium class into wrapper and Search class so wrapper can be used in processors! * add screenshots; add firefox extension support * update selenium definitions * regex for extracting urls from strings * screenshots processor; extract urls from text and takes screenshots * Allow producing zip files from data sources * import time * pick better default * test screenshot datasource * validate all params * fix enable extension * haha break out of while loop * count my items * whoops, len() is important here * must be getting tired... * remove redundant logging * Eager loading for screenshots, viewport options, etc * Woops, wrong folder * Fix label shortening * Just 'queue' instead of 'search queue' * Yeah, make it headless * README -> DESCRIPTION * h1 -> h2 * Actually just have no header * Use proper filename for downloaded files * Configure whether to offer pseudonymisation etc * Tweak descriptions * fix log missing data * add columns to post_topic_matrix * fix breadcrumb bug * Add top topics column * Fix selenium config install parameter (Docker uses this/manual would need to run install_selenium, well, manually) * this processor is slow; i thought it was broken long before it updated! * refactor detect_trackers as conversion processor not filter * add geckodriver executable to docker install * Auto-configure webdrivers if available in PATH * update screenshots to act as image-downloader and benefit from processors * fix is_compatible_with * Delete helper-scripts/migrate/migrate-1.30-1.31.py * fix embeddings is_compatible_with * fix up UI options for hashing and private * abstract was moved to lib * various fixes to selenium based datasources * processors not compatible with image datasets * update firefox extension handling * screenshots datasource fix get_options * rename screenshots processor to be detected as image dataset * add monthly and weekly frequencies to wayback machine datasource * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily * add scroll down page to allow lazy loading for entire page screenshots * screenshots: adjust pause time so it can be used to force a wait for images to load I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine * hash URLs to create filenames * remove log * add setting to toggle display advanced options * add progress bars * web archive fix query validation * count subpages in progress * remove overwritten function * move http response to own column * special filenames * add timestamps to all screenshots * restart selenium on failure * new build have selenium * process urls after start (keep original query parameters) * undo default firefox * quick max * rename SeleniumScraper to SeleniumSearch todo: build SeleniumProcessor! * max number screenshots configurable * method to get url with error handling * use get_with_error_handling * d'oh, screenshot processor needs to quit selenium * update log to contain URL * Update scrolling to use Page down key if necessary * improve logs * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way. Also, could I get categories from the metadata? That's... ugh. * no category, no processor * str errors * screenshots: dismiss alerts when checking ready state is complete * set screenshot timeout to 30 seconds * update gensim package * screenshots: move processor interrupt into attempts loop * if alert disappears before we can dismiss it... * selenium specific logger * do not switch window when no alert found on dismiss * extract wait for page to load to selenium class * improve descriptions of screenshot options * remove unused line * treat timeouts differently from other errors these are more likely due to an issue with the website in question * debug if requested * increase pause time * restart browser w/ PID * increase max_workers for selenium this is by individual worker class not for all selenium classes... so you can really crank them out if desired * quick fix restart by pid * avoid bad urls * missing bracket & attempt to fix-missing dependencies in Docker install * Allow dynamic form options in processors * Allow 'requires' on data source options as well * Handle list values with requires * basic processor for apple store; setup checks for additional requirements * fix is_4cat_class * show preview when no map_item * add google store datasource * Docker setup.py use extensions * Wider support for file upload in processors * Log file uploads in DMI service manager * add map_item methods and record more data per item need additional item data as map_item is staticmethod * update from master; merge conflicts * fix docker build context (ignore data files) * fix option requirements * apple store fix: list still tries to get query * apple & google stores fix up item mapping * missed merge error * minor fix * remove unused import * fix datasources w/ files frontend error * fix error w/ datasources having file option * better way to name docker volumes * update two other docker compose files * fix docker-compose ymls * minor bug: fix and add warning; fix no results fail * update apple field names to better match interface * update google store fieldnames and order * sneak in jinja logger if needed * fix fourcat.js handling checkboxes for dynamic settings * add new endpoint for app details to apple store * apple_store map new beta app data * add default lang/country * not all apps have advisories * revert so button works * add chart positions to beta map items * basic scheduler To-do - fix up and add options to scheduler view (e.g. delete/change) - add scheduler view to navigator - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view) - more testing... * update scheduler view, add functions to update job interval * revert .env * working scheduler! * basic scheduler view w/ datasets * fix postgres tag * update job status in scheduled_jobs table * fix timestamp; end_date needed for last run check; add dataset label * improve scheduler view * remove dataset from scheduled_jobs table on delete * scheduler view order by last creation * scheduler views: separate scheduler list from scheduled dataset list * additional update from master fixes * apple_store map_items fix missing locales * add back depth for pagination * correct route * modify pagination to accept args * pagination fun * pagination: i hate testing on live servers... * ok ok need the pagination route * pagination: add route_args * fix up scheduler header * improve app store descriptions * add azure store * fix azure links * azure_store: add category search * azure fix type of config update timestamp OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly * basic aws store * check if selenium available; get correct app_id * aws: implement pagination * add logging; wait for elements to load after next page; attempts to rework filter option collection * apple_store: handle invalid param error * fix filter_options * aws: fix filter option collection! * more merge * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations * migrate.py to run extension "fourcat_install.py" files * formatting * remove extensions; add gitignore * excise scheduler merge * some additional cleanup from app_studies branch * allow nested datasources folders; ignore files in extensions main folder * allow extension install scripts to run pip if migrate.py has not * Remove unused URL functions we could use ural for * Take care of git commit hash tracking for extension processors * Get rid of unused path.versionfile config setting * Add extensions README * Squashed commit of the following: commit cd356f7 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314e Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd7 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters commit 2ef5c80 Author: Stijn Peeters <[email protected]> Date: Tue Sep 3 12:05:14 2024 +0200 Actually check progress in text annotator commit 693960f Author: Stijn Peeters <[email protected]> Date: Mon Sep 2 18:03:18 2024 +0200 Add processor for stormtrooper DMI service commit 6ae964a Author: Stijn Peeters <[email protected]> Date: Fri Aug 30 17:31:37 2024 +0200 Fix reference to old stopwords list in neologisms preset * Fix Github links for extensions * Fix commit detection in extensions * Fix extension detection in module loader * Follow symlinks when loading extensions Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir * Make queue message on create page more generic * Markdown in datasource option tooltips * Remove Spacy model from requirements * Add software_source to database SQL --------- Co-authored-by: Stijn Peeters <[email protected]> Co-authored-by: Stijn Peeters <[email protected]> commit cd356f7 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314e Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd7 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters

not sure pycharm's merge is super awesome...

dale-wahl added 5 commits August 19, 2024 14:22

export processor

f196be2

start of importer

e23f925

finish off importing ZIP 4CAT datasets

358aca2

ensure cleanup on failure

53f76dc

had some weird lost datasets when debugging this

auto-expire export zips

f944c71

dale-wahl requested a review from stijn-uva September 6, 2024 14:04

dale-wahl and others added 9 commits September 6, 2024 16:09

Merge branch 'master' into dataset_export_import

db151bb

nltk again

abd9b11

merge docker files

bc5dd4e

Merge branch 'master' into dataset_export_import

b2a0be5

fix merge issues

56e3b7f

more modules passing fixes

0000c4c

disappearing import

a8e6aff

not sure pycharm's merge is super awesome...

fix import 4cat datasource with modules changes

3793ee9

stijn-uva merged commit a224dd9 into master Oct 1, 2024
1 check passed

dale-wahl deleted the dataset_export_import branch October 1, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

dale-wahl commented Sep 6, 2024

Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

Export 4CAT datasets and analyses as ZIP file... and import them elsewhere! #452

Conversation

dale-wahl commented Sep 6, 2024